Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: Investigate Canon's uniform-integer sampling method #1196

Closed
wants to merge 17 commits into from
Closed

Conversation

dhardy
Copy link
Member

@dhardy dhardy commented Oct 20, 2021

See #1172. Also includes #1154 as a comparator.

Some of this work is @TheIronBorn's; I have cleaned up and tweaked his impls a bit, and written new benchmarks using Criterion.

Algorithms to compare:

Of these, Canon's method definitely appears the best all-round method. I'm not done testing yet and may use some tweaks for i128 and some SIMD cases.

Cases:

  • repeated sampling from a distribution, using ranges with low and high probabilities of rejection (picked by @TheIronBorn)
  • single sampling, using the same low and high ranges, and using varying ranges (the upper bound is incremented before each sample)

Research questions

Is sample faster than sample_single?
Yes, definitely (sample_single is approx 50% slower). It's clear so I won't bother with evidence here.

Is it better to use u32 over u64 random numbers when sampling small integers?
For Canon-Lemire, maybe for i8/i16. For Canon, no. For i32, no. Evidence.
(Note: several benches vary by 2.5%, one by 9%, so take with a pinch of salt.)

Note further: the answer may be different on 32-bit CPUs. Tested on 5800X.

Is it better to use u64 instead of u128 samples for 128-bit integers?
Yes (~15% faster for non-SIMD distribution sampling). Evidence.

Which is the best algorithm for XX?
Each case will get a new post below.

CC @ctsrc @stephentyrone

@dhardy dhardy added B-value Breakage: changes output values X-experimental Type: experimental change labels Oct 20, 2021
@dhardy dhardy changed the title Investigate Canon's uniform-integer sampling method WIP: Investigate Canon's uniform-integer sampling method Oct 20, 2021
@dhardy
Copy link
Member Author

dhardy commented Oct 20, 2021

Which algorithm to use for repeated sampling from a distribution? (non-SIMD)

Canon and Canon-Lemire are basically the same, and clear winners for i32 and
i64 (same at i128 except that Bitmask also does similarly well). At i8/i16 and
for low-reject cases, all algorithms are about the same.

For i8 and i32, Canon appears slightly better than Canon-Lemire, though I'm not
convinced the results are significant. In any case, it makes sense to choose
Canon when the two are the same.

Evidence

@dhardy
Copy link
Member Author

dhardy commented Oct 22, 2021

How much bias is acceptable for samples?
This is a slightly trickier question: is using a single u64 sample for i8 / i16 results acceptable (leaving bias of 1-in-2^56 / 1-in-2^48)?
In practice the performance advantage is tiny for discrete samples (from a fixed distribution: 941μs vs 991μs or 5% faster for i8, 6% for i16), but for SIMD this may be larger.

Results (see Canon vs Canon64)
uniform_dist_int_i8/Old/high_reject                                                                             
                        time:   [1.0550 ns 1.0564 ns 1.0579 ns]
                        change: [-4.8870% -4.7632% -4.6061%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
  5 (5.00%) high mild
uniform_dist_int_i8/Old/low_reject                                                                             
                        time:   [1.0613 ns 1.0631 ns 1.0655 ns]
                        change: [-4.2396% -4.0519% -3.7987%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 8 outliers among 100 measurements (8.00%)
  2 (2.00%) low mild
  5 (5.00%) high mild
  1 (1.00%) high severe
uniform_dist_int_i8/Lemire/high_reject                                                                             
                        time:   [1.2753 ns 1.2757 ns 1.2762 ns]
                        change: [+0.0946% +0.1332% +0.1760%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 17 outliers among 100 measurements (17.00%)
  1 (1.00%) low mild
  7 (7.00%) high mild
  9 (9.00%) high severe
uniform_dist_int_i8/Lemire/low_reject                                                                             
                        time:   [1.2861 ns 1.2864 ns 1.2869 ns]
                        change: [-0.4160% -0.3649% -0.3147%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 4 outliers among 100 measurements (4.00%)
  1 (1.00%) low mild
  1 (1.00%) high mild
  2 (2.00%) high severe
uniform_dist_int_i8/Canon/high_reject                                                                             
                        time:   [990.75 ps 990.89 ps 991.05 ps]
                        change: [-5.7790% -5.7415% -5.7063%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
  2 (2.00%) high mild
  3 (3.00%) high severe
uniform_dist_int_i8/Canon/low_reject                                                                             
                        time:   [982.39 ps 982.87 ps 983.55 ps]
                        change: [-6.6996% -6.6568% -6.6131%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 11 outliers among 100 measurements (11.00%)
  3 (3.00%) high mild
  8 (8.00%) high severe
uniform_dist_int_i8/Canon64/high_reject                                                                             
                        time:   [940.71 ps 941.12 ps 941.56 ps]
Found 12 outliers among 100 measurements (12.00%)
  1 (1.00%) low severe
  2 (2.00%) low mild
  8 (8.00%) high mild
  1 (1.00%) high severe
uniform_dist_int_i8/Canon64/low_reject                                                                             
                        time:   [941.74 ps 942.07 ps 942.43 ps]
Found 8 outliers among 100 measurements (8.00%)
  1 (1.00%) low severe
  4 (4.00%) low mild
  3 (3.00%) high mild
uniform_dist_int_i8/Canon-Lemire/high_reject                                                                             
                        time:   [992.74 ps 994.00 ps 996.01 ps]
                        change: [-5.4800% -5.4102% -5.3175%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
  2 (2.00%) high mild
  4 (4.00%) high severe
uniform_dist_int_i8/Canon-Lemire/low_reject                                                                             
                        time:   [998.76 ps 999.10 ps 999.47 ps]
                        change: [-4.3281% -4.2604% -4.1969%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 3 outliers among 100 measurements (3.00%)
  2 (2.00%) high mild
  1 (1.00%) high severe
uniform_dist_int_i8/Bitmask/high_reject                                                                             
                        time:   [1.0390 ns 1.0397 ns 1.0409 ns]
                        change: [-3.4370% -3.1355% -2.6074%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
  3 (3.00%) high mild
  4 (4.00%) high severe
uniform_dist_int_i8/Bitmask/low_reject                                                                             
                        time:   [835.01 ps 835.25 ps 835.54 ps]
                        change: [-0.3614% -0.2779% -0.2112%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 5 outliers among 100 measurements (5.00%)
  1 (1.00%) low mild
  2 (2.00%) high mild
  2 (2.00%) high severe
uniform_dist_int_i16/Old/high_reject                                                                             
                        time:   [1.1080 ns 1.1084 ns 1.1088 ns]
                        change: [+5.1971% +5.3899% +5.6302%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 5 outliers among 100 measurements (5.00%)
  3 (3.00%) high mild
  2 (2.00%) high severe
uniform_dist_int_i16/Old/low_reject                                                                             
                        time:   [1.0954 ns 1.0957 ns 1.0960 ns]
                        change: [+2.0607% +2.2329% +2.3994%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 16 outliers among 100 measurements (16.00%)
  4 (4.00%) low severe
  7 (7.00%) high mild
  5 (5.00%) high severe
uniform_dist_int_i16/Lemire/high_reject                                                                             
                        time:   [1.0487 ns 1.0490 ns 1.0492 ns]
                        change: [+0.3726% +0.6616% +0.9021%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 2 outliers among 100 measurements (2.00%)
  1 (1.00%) low mild
  1 (1.00%) high mild
uniform_dist_int_i16/Lemire/low_reject                                                                             
                        time:   [1.0490 ns 1.0494 ns 1.0498 ns]
                        change: [-0.4838% -0.4226% -0.3644%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 4 outliers among 100 measurements (4.00%)
  4 (4.00%) high mild
uniform_dist_int_i16/Canon/high_reject                                                                             
                        time:   [998.31 ps 998.53 ps 998.78 ps]
                        change: [+0.0661% +0.1725% +0.2543%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild
uniform_dist_int_i16/Canon/low_reject                                                                             
                        time:   [991.89 ps 992.23 ps 992.65 ps]
                        change: [-0.2821% -0.2012% -0.1351%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 8 outliers among 100 measurements (8.00%)
  5 (5.00%) high mild
  3 (3.00%) high severe
uniform_dist_int_i16/Canon64/high_reject                                                                             
                        time:   [938.18 ps 938.39 ps 938.62 ps]
Found 10 outliers among 100 measurements (10.00%)
  8 (8.00%) high mild
  2 (2.00%) high severe
uniform_dist_int_i16/Canon64/low_reject                                                                             
                        time:   [937.62 ps 937.98 ps 938.37 ps]
Found 3 outliers among 100 measurements (3.00%)
  3 (3.00%) high mild
uniform_dist_int_i16/Canon-Lemire/high_reject                                                                             
                        time:   [983.96 ps 985.47 ps 987.44 ps]
                        change: [-5.7130% -5.5611% -5.3530%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
  2 (2.00%) high mild
  5 (5.00%) high severe
uniform_dist_int_i16/Canon-Lemire/low_reject                                                                             
                        time:   [999.00 ps 999.33 ps 999.73 ps]
                        change: [-5.0901% -5.0338% -4.9807%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 10 outliers among 100 measurements (10.00%)
  5 (5.00%) high mild
  5 (5.00%) high severe
uniform_dist_int_i16/Bitmask/high_reject                                                                             
                        time:   [864.94 ps 865.76 ps 866.93 ps]
                        change: [+0.4297% +0.7857% +1.2749%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 7 outliers among 100 measurements (7.00%)
  2 (2.00%) high mild
  5 (5.00%) high severe
uniform_dist_int_i16/Bitmask/low_reject                                                                             
                        time:   [845.80 ps 851.63 ps 862.39 ps]
                        change: [+1.6534% +2.0227% +2.5567%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 5 outliers among 100 measurements (5.00%)
  2 (2.00%) high mild
  3 (3.00%) high severe
  

@dhardy
Copy link
Member Author

dhardy commented Mar 2, 2022

So, the first round of analysis (non-SIMD integer types only) is complete.

There may be a second round after a few tweaks to these algorithms. However, most important for now is to answer a few questions regarding bias and use of multiple different algorithms (see Final Thoughts).

@TheIronBorn
Copy link
Contributor

This is a very comprehensive report, thank you

@dhardy
Copy link
Member Author

dhardy commented Dec 6, 2022

I didn't finish this (changing the uniform sampler) yet because from the report it's not really clear which sampler to use.

I have also run benchmarks on my laptop CPU (Intel 11th gen) to compare. I apologise if it's not very clear which commit to use to run these benches yourself; I think it's probably 9d0c88a (i.e. this branch before the two temp commits).

It would also be nice if we had some method of determining at compile time the type of RNG used, thus allowing algorithm selection based on the RNG. Unfortunately this isn't really feasible with Rust's current feature set (probably we need specialization and/or generic_const_exprs).

@TheIronBorn
Copy link
Contributor

Is it worth picking some good-enough algorithm which solves the #1145 issue?

The over-engineered solution is add the algorithm selection to the RngCore trait and allow RNGs to specify which to use. That's rather nasty though.

@dhardy
Copy link
Member Author

dhardy commented Jan 14, 2023

I uploaded the Intel 1145 benchmarks here.

Benches should be run from this branch. If you would like to benchmark on a different CPU:

  1. run: cargo bench --bench uniform
  2. copy the results from target/criterion to a new folder in the root and add to git
  3. make a PR against that branch

@dhardy
Copy link
Member Author

dhardy commented Feb 1, 2023

Notes on review of distr_random_* and single_random_* benches across both CPUs:

distr_random_i8:
Decent: all. Differences on Pcg64 and SmallRng are barely significant.
Best is probably Canon32. Lemire and Old are decent on Intel but less reliably.

distr_random_i16:
Decent: all. Again, differences are only really significant on two RNGs.
Best: Canon32. Good: Lemire, Old, maybe Canon-reduced.

distr_random_i32:
Decent (low var): Canon, Canon-Lemire, Unbiased, Canon-reduced. In one case (AMD ChaCha8) Canon-reduced has a strong win, otherwise differences are not significant.
High var: Canon32, Lemire, Old. Averages are similar, variance is much MUCH higher. Canon32 is worst.

distr_random_i64:
All: high variance
Poor: Canon-Unbiased
Best: Canon-Lemire.

distr_random_i128:
All: high variance
Best on average, but with long tails: Lemire, Old. Also decent: Canon-reduced, Canon-Lemire.
Worst: Canon-Unbiased

single_random_i8:
Good: all. Best: Canon32. Also great in some tests: Canon-reduced, ONeill.
Worst: Canon-Lemire, Unbiased.

single_random_i16:
Good: all. Best: Canon32. Also great in some tests: Canon-reduced, ONeill.
Worst: Canon-Lemire, Unbiased.

single_random_i32:
Best: Canon-reduced. Next best: Canon. Also decent, Unbiased, Canon-Lemire.
Terrible: everything else. Worst: Old.

single_random_i64:
All: high variance
Worst: Old, Unbiased. Poor: ONeill, Canon-Lemire (sometimes).

single_random_i128:
All high variance.
Best: Canon-reduced. Decent: Canon, Canon32.
Bad: Old, Canon-Lemire. Poor: ONeill.

Best for i8: Canon32. Decent: Canon-reduced. Both have max bias 1-in-2^56.
Best for i16: Canon32. Decent: Canon-reduced. Both have max bias 1-in-2^48.
Best for i32: Canon-reduced. But bias is 1-in-2^32. Next best: Canon.
Best for i64: Unclear.
Best for i128: Canon-reduced.

These are all variations of the Canon algorithm (usually with increased max bias):

  • Canon: initially use max(64, size) bits, optionally another max(64, size) bits
  • Canon32: initially use max(32, size) bits, optionally another max(32, size) bits
  • Canon-reduced, i8-i32: use 64 bits once only [should be named "Biased64"]
  • Canon-reduced, i64: use 64 bits initially, optionally another 32 bits
  • Canon-reduced, i128: use 128 bits initially, optionally another 128 bits

Additional variants may be worth exploring:

  • Use 64-bit initial sample, optionally another 32-bits (i.e. Canon-reduced applied properly to smaller sizes)
  • Use max(32, size) bits initial sample with up to two 32-bit reduction steps

@dhardy
Copy link
Member Author

dhardy commented Feb 1, 2023

Note: rand/src/seq/mod.rs has a little function to force sampling at a lower bit-width where possible since (a) this improves platform independence of results (for usize) and (b) it is noted to be faster.

fn gen_index<R: Rng + ?Sized>(rng: &mut R, ubound: usize) -> usize {
    if ubound <= (core::u32::MAX as usize) {
        rng.gen_range(0..ubound as u32) as usize
    } else {
        rng.gen_range(0..ubound)
    }
}

Maybe uniform samplers for 64 and 128-bit sizes should test the range and potentially use the algorithm for a smaller size?

There are a few issues however: (1) RNG bits used varies by inputs (though this is already non-constant with most algorithms), (2) more complex code with more ifs, (3) may not integrate well with distribution samplers.

@dhardy
Copy link
Member Author

dhardy commented Feb 2, 2023

I added a new algorithm:

  • Canon32-2: max(32, size) bit sample, optionally followed by one or two 32-bit bias reduction steps

New results:

i8, i16: Canon32-2 is slightly slower than Canon32 (but with less bias). Mostly well ahead of Canon (same bias), but loses in some tests. Therefore Canon32-2 may be the best choice here if we want less bias than 1-in-2^48.

i32: high variance, slower than most, easily beaten by Unbiased. These benches give all 64-bit samplers very low variance and all 32-bit samplers very high variance. At a guess, this is because 64-bit samplers (almost) never need a bias-reduction step therefore the branch predictor is highly effective? Best option remains Canon.

i64: Canon32-2 is 10-20% behind Canon32, sometimes ahead of Unbiased but behind Canon and Canon-Lemire. Looks like Canon-Lemire is the best option here (so the extra cost of the improved bias check is worth it here, but not for i32 output with 64-bit RNG output).

i128: Canon32-2 is well behind Canon-reduced which remains the best choice.

@dhardy
Copy link
Member Author

dhardy commented Feb 2, 2023

I previously wrote (in the report):

It would be nice if we could select an algorithm at compile-time based on whether we are using a 32-bit or 64-bit or block RNG.

Looking again at the results: Pcg64 Canon beats Pcg32 Canon32 when using algorithms sampling 64-bit values and Pcg32 Canon32 wins when sampling 32-bit values, but the differences are usually not large enough to influence the choice of algorithm. Exception: for i8/i16 output, Canon-reduced beats Canon32 on 64-bit RNGs while Canon32 wins on 32-bit RNGs (including ChaCha, which consumes 32-bit words from a block).

Bias: according to my benchmarks, one uniform distribution sample takes (very vaguely) 100ns, allowing about 3e14 samples per CPU-core per year. Bias of 1-in-2^48 samples implies approximately one biased sample per CPU-year. This only affects i8 / u8 output, which I think is acceptable, especially considering this type is unlikely to be used where this type of bias really matters. Possibly we should also have a low/no bias feature flag, but that can be a different issue (i.e. for later/never).

(Alternatively, Canon32-2 reduces the bias with 8-16-bit output at little cost.)

Therefore I propose we use the algorithms mentioned above: Canon32 for 8-16 bit output, Canon for 32-bit, Canon-Lemire for 64-bit and Canon-reduced for 128-bit. Yes, having so many variants is a little weird, but the differences are mostly small, and not everything reduces well to single a generic implementation anyway.

Regarding SIMD types, I propose to leave these for later.

@vks
Copy link
Collaborator

vks commented Feb 2, 2023

Therefore I propose we use the algorithms mentioned above: Canon32 for 8-16 bit output, Canon for 32-bit, Canon-Lemire for 64-bit and Canon-reduced for 128-bit. Yes, having so many variants is a little weird, but the differences are mostly small, and not everything reduces well to single a generic implementation anyway.

How would this affect code size compared to the status quo?

@dhardy
Copy link
Member Author

dhardy commented Feb 2, 2023

LOC? Probably a bit worse. I may be able to make impls partially share via macros. Conceptually the differences between the algorithms are small.

Compiled? Generic code is monomorphised at compile time anyway.

@dhardy
Copy link
Member Author

dhardy commented Feb 6, 2023

Note: this branch is behind canon-uniform-benches (which also includes benchmark results). Neither is appropriate for merging.

My analysis above misses something: the Canon-Lemire optimisation is essentially an extra cost up-front (% op) for less rejection from the initial sample. With distributions, this cost is only paid once, therefore probably always worthwhile (though it does use an extra field in the struct). With single-sampling, the benches say it is only useful for 64-bit sampling. Further, this should be applicable to all variants of the Canon method; current benchmarks only include it on the base method.

@dhardy
Copy link
Member Author

dhardy commented Feb 6, 2023

Originally I assumed the Canon-Lemire variant was a mistake, but it was mentioned on Twitter.

The Canon method is based off multiplying a theoretically infinite-precision random value in the range [0, 1) by the range, taking the integer part (i.e. rounding down). In practice, a finite number of random bits are sampled, and, if the fractional part of the result is close enough to 1 that further bits could influence the last integer digit, then more digits must be sampled to avoid bias.

The Lemire method is effectively the familiar sample % range method, rejecting samples which could cause bias.

It just so happens that the code for the two, up to checking the first threshold, is identical, but with Lemire's method using a tighter threshold. Presumably this is where the idea for Canon-Lemire came from.

Unfortunately, it doesn't work. Lemire rejects some samples to avoid bias; Canon's method does not reject any and assigns exactly 1/range of the theoretical infinite-precision random number to each output. This example demonstrates a whole 2^-64 of that range which Lemire's method assigns to 0 and Canon's method assigns to 1; therefore the Canon-Lemire method is biased.

(A waste of time — shows why I should never have accepted a hand-wavey argument without proof in the first place!)

@vks
Copy link
Collaborator

vks commented Feb 6, 2023

Sorry you found out about the bias so late, that's really frustating. Seems like Canon-Lemire was not sufficiently tested?

@TheIronBorn
Copy link
Contributor

Ah sorry, I should have been more rigorous before including it

@dhardy
Copy link
Member Author

dhardy commented Feb 7, 2023

was not sufficiently tested?

Testing for bias is only really practical with small samplers, and all of these samplers are at least 32-bit. Yes I could build an artificially smaller version, but I didn't get around to it.

@dhardy
Copy link
Member Author

dhardy commented Feb 7, 2023

Some more benchmark results, reduced to five algorithms at each size, chosen from:

  • Lemire: widening multiply with rejection test
  • ONeill: O'Neill's proposed method: https://www.pcg-random.org/posts/bounded-rands.html
  • Canon: widening multiply using max(64, ty-bits) sample, followed by one bias-reduction step at the same sample size where required
  • Canon32: as above, but using max(32, ty-bits) sample sizes
  • Canon32-2: as above, but with up to two bias-reduction steps
  • Canon-Un[biased]: as Canon, but with an unlimited number of bias-reduction steps
  • Canon-Red[uced]: Canon's method with max(64, ty-bits) initial sample size and half with for the bias reduction step steps
  • Canon-Red-Un: as Canon-Red, but with an unlimited number of bias reduction steps

Canon32-unbiased and Canon-reduced-unbiased are new to this round of testing.

Conclusions

Best for i8, i16: Canon32, or if less bias is wanted, Canon32-2 or Canon32-Un.

Best for i32: Canon or Canon-Un (Biased64 is fastest, but too biased).

Best for i64: Canon, Lemire, ONeill or Canon-Un.

Best for i128: Probably Lemire (distributions only). Otherwise, hard to choose.

Thoughts? I can't think of any more variants; we should probably pick from the above.

We should consider bias: Canon32 samples up to 64 bits and Canon32-2 up to 96, Canon up to 128 (double for i128 output); subtract the output size to get worst-case bias. Probably any of these (except Biased64 for i32) is fine.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
B-value Breakage: changes output values X-experimental Type: experimental change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants